Reproducible
data analysis

Guido Biele

Reproducible research in a context

Providing a reproducible analysis is the most important and easiest aspect of open science

  • data cannot always be easily shared
  • analysis connects data and results
  • confirming pre-registered hypotheses is less valuable without it
  • pre-registration is often not possible or useful 1

Why reproducible research?

  • builds trust
  • reduces errors
  • makes it easier to write the method section 1
  • streamlines manuscript writing

Levels of reproducible data analysis

  1. Script all analysis steps
  2. Version control software & public repositories for publication
  3. Scientific and technical publishing system for documenting and implementing analysis pipeline
  4. Scientific and technical publishing system for writing papers

Tools for reproducible data analysis

  • or another programming language
  • or other IDEs with git/github integration
  • or other version control software
  • or other

1. Scripting

Different scripts for different steps (Example)

  • utility functions1
  • data cleaning and preparation2
  • running analyses 3
  • preparing analysis results for the manuscript

Working around slow analysis parts

Save intermediate steps

  • pre-processing
  • slow model estimation (MCMC)
fn = "my_analysis_results.Rdata"

if (file.exists(fn)) {
  load(fn)
} else {
  # make my analysis
  save(my_fit, file = fn)
}

2. Version control

Start each paper by setting up a (git/github) project

  • easily allows to go back to older versions of analyses
  • saves as backup method 1
  • allows to work jointly on a project
  • is well integrated with git/github (show)
  • first step to later publishing the code
  • Hard: Sufficient & continuous code documentation

3. Analysis pipeline in R Markdown

  • R Markdown document1 for
    • data cleaning
    • statistical analysis
    • supplementary
      analyses & results
    • generation of
      statistics for paper

4. Writing papers with papaja

on CRAN & github

  • APA-6 compliant manuscripts via R Markdown
  • GUI to search references (based on bibtex files)
  • automatic references to figures & tables in the supplement
  • APA styles for tables and figures
  • outputs MS Word or PDF files
  • Upfront investment, but provides huge benefits!

Example

Things I didn’t talk about

Summary

Why reproducible analysis?

  • makes good research easier
    • less errors
    • faster manuscripts
    • reviewers appreciate 1
  • reliability & trust
  • lead by example

Why not?

  • Start-up investment
  • can’t hide parts of the analysis
  • You want to write as fast as possible a couple of papers to finally become a PI